In this blog post, we are going to talk about a few tools

Permutation Importance: Math and Intuition

Everyone loves tree based models. Gradient boosting, random forests, and friends are wonderful, flexible tools. One of the other benefits of these models, because of their tree-ness, is that we are able to actually see how “important” each variable is in the decisions the model is making. This is also one of many many reasons why we love linear models, we can actually see and quantify the strength of a feature in our model. However, why must we limit ourselves to just linear and tree based models?

Lets try and think of a new approach to get variable importance. When I was first really getting into ML, I remember asking one of my professors the question: “How much time do you spend on feature engineering?”. I will never forget his answer, he told me: “Feature engineering [is] the most crucial part to improve both accuracy and model generation. If you have an unneccessary feature in the model, you are in essence fitting noise.”. This has stuck with me for a long time, and it is a useful thing to keep in mind while discussing permutation importance.

If unneccessary features just provide noise which decreases model accuracy and generalization, what happens if we replace a good feature with noise? Our model should be inherently worse, no? This is the key idea of permutation importance.

If I replace a feature with noise, how much worse does the model perform?

This is the key idea of permutation based variable importance. All we are going to do to calculate this is three simple steps:

  1. Calculate prediction loss

  2. Replace a feature with noise

  3. Recalculate prediction loss

  4. Compare

Formalization

Lets formulate permutation importance mathematically now! First, lets define our data as the set \(x\), with \(m\) observations and \(n\) features. Next, lets consider two sets within \(x\): \(x_s\) and \(x_c\). \(x_s\) represents the feature(s) we are interested in, and \(x_c\) represents the complement of \(x_s\) (in english, everything else). Thus:

\[x = (x_s, x_c)\]

Lets first define the original loss with the original features as \(\mathcal{L}\),

\[ \mathcal{L} = \mathrm{loss}\left(x_s, x_c \right) \]

Next, we need to replace \(x_s\) with noise. To do that, we want to sample the marginal distribtion of \(x_s\). This means we want to sample the distribution of \(x_s\) independent of other features. With a reasonably sized dataset, we can just do a permutation of \(x_s\) for more or less the same result. We will denote the permutation of \(x_s\) as \(x_s^*\). Next, lets define the loss, \(\mathcal{L}*\) of the permuted feature: \[ \mathcal{L}* = \mathrm{loss}\left(x_s^*, x_c \right) \]

Finally, we can calculate the variable importance of \(x_s\):

\[ VIP_{\mathrm{perm}}(x_s) = \frac{\mathcal{L}*}{\mathcal{L}} \]

There we go! Its that simple! An addendum to this suggested by Jeremy Howard of fast.ai: Add a feature of pure noise and see how important that is, for reference.

Code Implementation

For our data, we will use the dataset discussed in Benjamin Tayo’s amazing blog post. We use this dataset because it presents a large challenge to us, with highly correlated features. This will show some of the pitfalls of some of our techniques.

The goal with this dataset is to predict the number of crew members which will be on a cruise ship, given some paramters describing the ship. I believe the independent variables are fairly self explanatory. First, lets read the data into python and do a train test split:

import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston, fetch_california_housing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error as loss_mse
import math
import statistics as stats
import matplotlib.cm as cm
from pprint import pprint

# so this is readable:
import warnings
from sklearn.exceptions import DataConversionWarning, ConvergenceWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=ConvergenceWarning)
cruise = pd.read_csv("https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size/raw/master/cruise_ship_info.csv")


X = cruise.loc[:, cruise.columns != "crew"]
X = X.loc[:, X.columns != "Ship_name"]
X = X.loc[:, X.columns != "Cruise_line"]
y = cruise.loc[:, cruise.columns == "crew"]


def split(df, p_train = 0.75, random_state = 0):
    train = df.sample(frac = p_train, random_state = random_state)
    test = df.drop(train.index)
    return(train, test)

(X_train, X_test), (y_train, y_test) = (split(x) for x in [X, y])

Next, lets use the amazing reticulate package to pass these exact data frames into R:

X <- py$X
y <- py$y
X_train <- py$X_train
X_test <- py$X_test
y_train <- py$y_train
y_test <- py$y_test
X

Now we are all set up to implement permutation importance in python and in R




(Click tabs below to change language!)

In Python

First, lets set up three models to test: A linear model, a neural network, and a random forest:

lm = LinearRegression()
knn = KNeighborsRegressor(13) 
rf = RandomForestRegressor(n_estimators = 100)

models = [lm, knn, rf]

for m in models:
  m.fit(X_train, y_train)
#> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
#> KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
#>                     metric_params=None, n_jobs=None, n_neighbors=13, p=2,
#>                     weights='uniform')
#> RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#>                       max_features='auto', max_leaf_nodes=None,
#>                       min_impurity_decrease=0.0, min_impurity_split=None,
#>                       min_samples_leaf=1, min_samples_split=2,
#>                       min_weight_fraction_leaf=0.0, n_estimators=100,
#>                       n_jobs=None, oob_score=False, random_state=None,
#>                       verbose=0, warm_start=False)

Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:

pprint(dict(zip(X_train.columns, rf.feature_importances_)))
#> {'Age': 0.017940573693502024,
#>  'Tonnage': 0.41592715941929437,
#>  'cabins': 0.4194355080754462,
#>  'length': 0.07782246312009274,
#>  'passenger_density': 0.014576681398860572,
#>  'passengers': 0.0542976142928041}
pprint({X_train.columns[i]: lm.coef_[:,i] for i in range(lm.coef_.shape[1])})
#> {'Age': array([-0.01321228]),
#>  'Tonnage': array([0.00316145]),
#>  'cabins': array([0.79858261]),
#>  'length': array([0.39111394]),
#>  'passenger_density': array([0.01037499]),
#>  'passengers': array([-0.1045376])}

Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.

Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:

def permutation_importance(model, x, y, loss, base = False, x_train = None, y_train = None, kind = "prop", n_rounds = 5):
    explan = x.columns
    baseline = loss(y, model.predict(x))
    res = {k:[] for k in explan}
    if (base is True):
        res["baseline"] = []
    for n in range(0, n_rounds):
        for i in range(0, len(explan)):
            col = explan[i]
            x_temp = x.copy()
            x_temp[col] =  np.random.permutation(x_temp[col])
            if (kind is not "prop"):
                res[col].append(loss(y, model.predict(x_temp)) -  baseline)
            else:
                res[col].append(loss(y, model.predict(x_temp)) /  baseline)
        if (base is True):
            x_temp = x.copy()
            x_train2 = x_train.copy()
            x_temp["baseline"] = np.clip(np.random.normal(size = len(x_temp)), -1., 1.)
            x_train2["baseline"] = np.clip(np.random.normal(size = len(x_train2)), -1., 1.)
            mod2 = type(model)()
            mod2.fit(x_train2, y_train)
            if (kind is not "prop"):
                res["baseline"].append(loss(y, mod2.predict(x_temp)) -  baseline)
            else:
                res["baseline"].append(loss(y, mod2.predict(x_temp)) /  baseline)
    return(pd.DataFrame.from_dict(res))

Lets also define a helper function to help us do this for all our models, and then go ahead and calculate the importances!

# convert object name to string!
def get_name(obj):
    name =[x for x in globals() if globals()[x] is obj][0]
    return(name)

imps = {}
for m in models:
    imps[get_name(m)] = permutation_importance(m, X_test, y_test, loss_mse, True,  X_train, y_train, n_rounds = 5)
pprint(imps)
#> {'knn':         Age    Tonnage  passengers  ...    cabins  passenger_density  baseline
#> 0  0.953594  10.460554    1.363738  ...  0.977356           1.080228  1.310102
#> 1  0.913625  20.251634    1.070447  ...  1.059398           1.287608  1.318981
#> 2  1.029486  19.060316    1.190654  ...  0.934822           1.397907  1.312979
#> 3  1.054564  24.971323    1.262756  ...  0.979389           1.595057  1.306668
#> 4  0.780959  17.193917    1.113051  ...  1.014168           1.219071  1.309752
#> 
#> [5 rows x 7 columns],
#>  'lm':         Age   Tonnage  passengers  ...     cabins  passenger_density  baseline
#> 0  0.943507  1.157280    7.880352  ...  72.529664           1.148155  1.042532
#> 1  0.953535  0.956423    7.847115  ...  75.157145           1.049876  1.000712
#> 2  1.151137  1.027036    7.383386  ...  69.412267           0.942257  0.992352
#> 3  1.079984  0.969374    7.203343  ...  58.536218           1.070731  1.072106
#> 4  1.013006  1.057246    9.071541  ...  80.655969           1.154589  1.035966
#> 
#> [5 rows x 7 columns],
#>  'rf':         Age    Tonnage  passengers  ...     cabins  passenger_density  baseline
#> 0  1.391592  14.571149    2.146399  ...  21.261009           1.461168  1.619377
#> 1  1.132233  15.763448    2.168587  ...  20.955786           1.511130  1.610004
#> 2  1.385285  11.477794    1.971031  ...  13.162853           1.806996  1.566385
#> 3  1.359806  20.469226    2.993920  ...  21.803024           1.564689  1.655168
#> 4  1.311191  13.343430    1.964030  ...  21.434352           1.573079  1.972894
#> 
#> [5 rows x 7 columns]}

This output is a bit hard to read, so lets go ahead and write a helper function which averages the importances, and plots them nicely!

plt.style.use("ggplot")
def plot_perm_imp(df, ax = None, color = 'blue'):
    # mean the columns and put it back into a data frame
    df1 = (df.apply(stats.mean, 0, result_type = "broadcast")).drop(df.index[1:])
    # create a new data frame excluding baseline, so we can do something special with it
    df_temp = df1.loc[:, df.columns != 'baseline']
    # melt and sort it for plotting
    df2 = df_temp.melt(var_name = 'variable', value_name = 'importance')
    df2 = df2.sort_values(by = "importance")
    # plot it nicely
    df2.plot(kind = 'barh', x = 'variable', y = 'importance', width = 0.8, ax = ax, color = color)
    # draw a bar and an arrow to baseline
    for n in df1.columns:
        if n is "baseline":
            plt.axvline(x = df1[n][0])
            plt.annotate('baseline',
                         xy = (df1[n][0], 1),
                         xytext = (df1[n][0] + 0.4, 3),
                         arrowprops = dict(facecolor = 'black',
                                           shrink = 0.05),
                         bbox = dict(boxstyle = "square", fc = (1,1,1)))


# plot the importance dict!
fig = plt.figure(figsize=(10,10))
#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for i in range(len(imps.keys())):
    ax = fig.add_subplot(len(list(imps.keys())),1, i+1)
    c = sns.color_palette("hls", i+1)[i]
    plot_perm_imp(imps[list(imps.keys())[i]], ax = ax, color = c)
    ax.set_title(list(imps.keys())[i])
    for tick in ax.yaxis.get_major_ticks():
      tick.label.set_fontsize("x-small")
      tick.label.set_rotation(45)
plt.show()
Permutation Importances for linear regression, knn, and a random forest

Permutation Importances for linear regression, knn, and a random forest

Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.

In R

First, lets set up three models to test: A linear model, a knn model, and a random forest:
library(randomForest)
library(kknn)

# set up models with parameters
rf <- function(df) {
  return(randomForest(crew ~ ., data = df, ntree = 100))
}

# knn in R is annoying so we will need to a consistent api ourselves:

knn <- function(df) {
  res <- list()
  res$train <- df
  res$k <- 13
  res <- structure(res, class = "knn")
}

predict.knn <- function(obj, newdata) {
  out <-  kknn(crew ~ ., train = obj$train, test = newdata, k = 13)
  return(as.numeric(out$fitted.values))
}

linear_model <- function(df) {
  lm(crew ~ ., data = df)
}

models <- list("lm" = linear_model,"knn"= knn,"rf"= rf)

t_train <- cbind(X_train, y_train)

trained_models <- lapply(models, function(f) f(t_train))

Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:

pander::pander(summary(trained_models[["lm"]]))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9349 1.878 -0.4977 0.6197
Age -0.01321 0.02083 -0.6343 0.5272
Tonnage 0.003161 0.01692 0.1868 0.8522
passengers -0.1045 0.06831 -1.53 0.1288
length 0.3911 0.1624 2.409 0.01765
cabins 0.7986 0.1058 7.545 1.34e-11
passenger_density 0.01037 0.02618 0.3963 0.6927
Fitting linear model: crew ~ .
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
118 1.089 0.906 0.9009
pander::pander(importance(trained_models$rf))
  IncNodePurity
Age 65.06
Tonnage 371.6
passengers 196.5
length 280.3
cabins 414.8
passenger_density 25.14

Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.

Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:

# loss function
loss_mse <- function(truth, preds) {
  error <- truth - preds
  square_error <- error^2
  return(mean(square_error))
}

# get baseline loss
get_loss <- function(model, x, y, loss) {
  loss(y, predict(model, x))
}


# permute a single column
permute_column <- function(df, col) {
  df[[col]] <- df[[col]][sample(1:nrow(df))]
  return(df)
}


permutation_importance <- function(model, x, y, loss, x_train = NA, y_train = NA, n_rounds = 5) {
  baseline <- get_loss(model, x, y[[1]], loss)
  explan <- names(x)
  single_round_imp <- function(df) {
    dfs <- lapply(explan, function(i) permute_column(x, i))
    return(vapply(dfs, function(i) get_loss(model, i, y[[1]], loss), numeric(1)))
  }
  res <- lapply(1:n_rounds, function(x) single_round_imp(df))
  res <- as.data.frame(do.call(rbind, res))
  res <- as.data.frame(lapply(res, function(x) x/baseline))
  names(res) <- names(x)
  return(as.data.frame(lapply(res, mean)))
}

# we can do this in a second because R is speedy 
perm_vips <- lapply(trained_models, function(x) permutation_importance(x, X_test, y_test, loss_mse, n_rounds = 30))
pander::pander(perm_vips)
  • lm:

    Age Tonnage passengers length cabins passenger_density
    1.037 1.053 7.129 4.361 68.85 1.086
  • knn:

    Age Tonnage passengers length cabins passenger_density
    1.267 1.829 2.166 3.107 2.692 1.285
  • rf:

    Age Tonnage passengers length cabins passenger_density
    1.189 6.172 3.086 4.436 10.42 1.406

So, it looks like our importances are quite similar to the importances calculated by the models explicitly, so not a bad job! Lets go ahead and produce a nice plot!

library(tidyverse)
vip_flat <- lapply(perm_vips, gather)

# Finally an appropriate use of superassignment!!
lapply(names(vip_flat), function(x) {
  vip_flat[[x]][["model"]] <<- x
})  %>% invisible

do.call( rbind, vip_flat) %>% as.data.frame %>%
  ggplot(aes(x = key, y = value, fill = model)) + geom_bar(stat = "identity") +
  facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() + 
  ggthemes::scale_fill_fivethirtyeight() + ggtitle("Permutation Importance")

Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.

Partial Dependence

Let’s try and think of a annother way to quantify the effect of a variable. We previously answered the question: “How much worse will my prediction be, given a feature has been replaced with noise?”. Now, lets ask a new question: > What will my prediction be, in general, when a feature is at a specific value?

This is Partial Dependence in a nutshell. Stemming off from this, we can make the claim:

If a feature is more important, the prediction will vary more with that feature than others.

This is exactly the claim made (and supported) in this excellent paper. Thus in this section, we will calculate two things: Partial Dependence, and Partial Dependence Importance. Partial Dependence, at a high level, is calculated like this:

  1. Take your original data, copy it
  2. Replace the feature of interest with the value of the first observation of the feature of interest.
  3. Calculate new prediction vector
  4. Average that
  5. Repeat for all observations

And then we can just take more or less the standard deviation of these curves to estimate importance (its a bit different for categorical variables, see the paper).

Formalization

Consider again \(x = (x_s, x_c)\)

We can define the partial dependence as the expected value of our prediction function, \(\hat f\), given \(x_c\):

\[\mathrm{PDP}(z_s) = E \left[ \hat f (x_s, x_c) \mid x_c \right]\]

\[ = \int \hat f (x_s, x_c) \mathbb{P}_c(x_c) dx_c\] Where \(\mathbb{P}_c\) is the marginal distribution of \(x_c\), \(\int p(x)dz_s\).

We can then formulate this for a finite set of \(n\) features using the Monte Carlo method:

\[\widetilde {\mathrm{PDP}} (x_s) = \frac{1}{n} \sum_{i=1}^{i=k} \hat f (x_s, x_c^{(i)})\]

where \(x_c^{(i)}\) is a distinct observation(row) of \(x_c\)

Code Implementation

In Python

First, lets define a function to calculate the partial dependence of a prediction on a single variable:

def pdp_var(model, x, var):
    explan = sorted(x[var])
    preds = []
    for i in range(0, len(explan)):
        X_tmp = x.copy()
        # pandas is dumb
        val = np.asarray(explan)[i]
        X_tmp[var] = val
        preds.append(model.predict(X_tmp))
    preds = np.asarray(preds).reshape(len(x), len(explan))
    pv = preds.mean(axis = 0)
    return(explan, pv)

Next, lets scale that up to work over an entire data frame!

def pdp_df(model, x):
    # return a dict of data frames
    res = {}
    for c in x.columns:
        d = pd.DataFrame()
        d["Value"], d["Average Prediction"] = pdp_var(model, x, c)
        res[c] = d
    return(res)

Lets now show the partial dependence for just our random forest!


rf_pdp = pdp_df(rf, X_test)

plt.style.use("seaborn-whitegrid")
fig = plt.figure(figsize = (10, 10))
#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for k in range(0,len(rf_pdp.keys())):
    ax = fig.add_subplot(2,3, k+1)
    #c = cm.Paired(i/len(imps.keys()), 1)
    c = sns.color_palette("hls", k+1)[k]
    df = rf_pdp[list(rf_pdp.keys())[k]]
    sns.lineplot(x = "Value", y = "Average Prediction", ax = ax, color = c, data = df)
    ax.set_title(list(rf_pdp.keys())[k])
plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
plt.show()

Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too!

def pdp_importance(model, x):
    pdpdf = pdp_df(model, x)
    v = {k:np.std(pdpdf[k]["Average Prediction"]) for k in pdpdf.keys()}
    return(v)


pdp_imps = {get_name(m):pdp_importance(m, X_test) for m in models}

def plot_pdp_imp(d, ax = None, color = "blue"):
    df = pd.DataFrame()
    df["Variable"] = d.keys()
    df["Importance"] = d.values()
    df.sort_values("Importance").plot(kind = "barh", x = "Variable", y = "Importance", width = 0.8, ax = ax, color = c)

fig = plt.figure(figsize = (10,10))
#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for i in range(len(pdp_imps.keys())):
    ax = fig.add_subplot(len(list(pdp_imps.keys())),1, i+1)
    #c = cm.Paired(i/len(imps.keys()), 1)
    c = sns.color_palette("hls", i+1)[i]
    plot_pdp_imp(pdp_imps[list(pdp_imps.keys())[i]], ax = ax, color = c)
    ax.set_title(list(pdp_imps.keys())[i])
plt.show()

In R

First, lets define a function to calculate the partial dependence of a prediction on a single variable:

pdp_var <- function(model, x, var){
  x_sorted <- x[order(x[[var]], decreasing = TRUE),]
  # predefine matrix!
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  for (i in 1:nrow(x)) {
    tmp_df <- x_sorted
    tmp_df[[var]] <- tmp_df[[var]][i]
    out[i,] <- predict(model, tmp_df)
  }
  res <- colMeans(out)
  return(data.frame("value" = x_sorted[[var]], "avg_pred" = res))
}

Next lets scale that up over the entire data frame!

pdp_df <- function(model, x) {
pdps <- lapply(colnames(x), function(n) {
    res <- pdp_var(model, x, n)
    res$variable <- n
    return(res)
  })
  return(pdps)
}

(rf_pdp <- do.call(rbind, pdp_df(trained_models[[3]], X_test)))

Finally, lets plot these pdps!

library(ggthemes)
rf_pdp %>% ggplot(aes(color = variable, y =  avg_pred, x = value)) + 
  geom_line(size = 1.5) +
  facet_wrap(variable ~ ., scales = "free") +
  theme_fivethirtyeight() + scale_color_hc()

Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too! We are going to do a lot of lapply, as we are dealing with a list of lists of dataframes, and trying to quickly get that to a tidy format for ggplot

# return the pdp lists for all trained models
pdps <- lapply(trained_models, function(m) pdp_df(m, X_test))

# calculate the the standard deviation for average predictions
sd_pdp <- function(df) {
  imp <- as.numeric(sd(df$avg_pred))
  return(data.frame("importance" = imp, "variable" = df$variable[2]))
}

# we have a list of lists of data frames, so things are going to get a little weird
# We need to apply our function at depth of two, hence the double lapply
pdp_imps <- lapply(pdps, function(x) lapply(x, sd_pdp))

# next we need to clean up all the items of our lists are tidy data frames/matrices
pdp_imps <- lapply(pdp_imps, function(x) do.call(rbind, x))

# finally before we can plot, we add a character vector indicating model type
pdp_imps <- lapply(names(pdp_imps), function(x) {
  pdp_imps[[x]]$model <- x
  return(pdp_imps[[x]])
  }
)

# then we turn it all into a nice data frame
pdp_imps <- do.call(rbind, pdp_imps)

pdp_imps%>%  ggplot(aes(x = variable, y = importance, fill = model)) + geom_bar(stat = "identity") +
  facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() + 
  ggthemes::scale_fill_fivethirtyeight() + ggtitle("Partial Dependence Importance")

Discussion

Something is off!!!